Part I - (Prosper Loan Data)

by (Tolulope Ogunfuwa)

Preliminary Wrangling

11 of the columns have more than 50% of missing values. The next move is to drop these 11 columns.

There are several variables in this large data, but we will streamline our searchlight to a few of them.

They are:

'ListingNumber', 'ListingCreationDate', 'LoanOriginalAmount', 'LoanStatus', 'ListingCategory (numeric)', 'BorrowerState', 'BorrowerAPR', 'BorrowerRate', 'StatedMonthlyIncome', 'ProsperRating (Alpha)', 'Occupation', 'Term', 'EmploymentStatus', 'TotalInquiries', 'DebtToIncomeRatio', 'MonthlyLoanPayment', 'TotalTrades', 'Investors'

Information on the selected variables

1. ListingNumber == The number that uniquely identifies the listing to the public as displayed on the website.

2. ListingCreationDate == The date the listing was created.

3. LoanOriginalAmount == The origination amount of the loan.

4. LoanStatus == The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.

5. ListingCategory (numeric) == The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

6. BorrowerState == The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.

7. BorrowerAPR == The Borrower's Annual Percentage Rate (APR) for the loan.

8. BorrowerRate == The Borrower's interest rate for this loan.

9. StatedMonthlyIncome == The monthly income the borrower stated at the time the listing was created.

10. ProsperRating (Alpha) == The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.

11. Occupation == The Occupation selected by the Borrower at the time they created the listing.

12. Term == The length of the loan expressed in months.

13. EmploymentStatus == The employment status of the borrower at the time they posted the listing.

14. TotalInquiries == Total number of inquiries at the time the credit profile was pulled.

15. DebtToIncomeRatio == The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).

16. MonthlyLoanPayment == The scheduled monthly loan payment.

17. TotalTrades == Number of trade lines ever opened at the time the credit profile was pulled.

18. Investors == The number of investors that funded the loan.

From the properties, further wrangling needs to be done.

ProsperRating (Alpha) is an integral variable for analysis and it has 25% null values. That is a significant number compared to other variables that do not have up to 10% null values. In order to effectively work with it, only rows where ProsperRating (Alpha) is not null will be selected.

There are still missing values for "Occupation" & "DebtToIncomeRatio".

To deal with these missing values, replace the null values in "Occupation" with "Unknown".

While for "DebtToIncomeRatio", replace with the mean value.

The data type of "TotalInquiries" & "TotalTrades" need to be changed from float to int.

The "ListingCreationDate" also needs dissecting. The column can be split into "Day", "Month", "Year"

Change the values for 'Month' from numbers to names.

The "ListingCategory (numeric)" needs to be adjusted and renamed.

What is the structure of your dataset?

The working dataset of contains 83982 loans with 21 features was sieved out from 113937 loans and 81 features

What is/are the main feature(s) of interest in your dataset?

The main feature(s) include the Borrower's Annual Percentage Rate (APR), Borrower Rate, & Prosper Rating

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Other features I expect to help support are: Occupation, Term, Employment Status & Debt To Income Ratio

Univariate Exploration

The distribution for the Year indicates that most of the loans were collected in 2013. I don't have enough information to give a suggestion to why 2013 experienced an overwhelming number of loan collection.

The distribution for the Month shows that January, October, & December are the three most active months for loan collection. These months are either toward the end of a year or the beginning of a new year. It can be suggested that people like to borrow loans in those periods to prepare for the festive season as well as prepare for what the new year holds in terms of personal projects, school fees, & other expenses.

The distribution indicates that it is multimodal. Across the distribution, we see about four different peak periods. One is between 0.08 & 0.09. The 2nd is between 0.2 & 0.22. The third is between 0.28 & 0.29. And the last one, with the most counts, is between 0.34 & 0.36. The distribution mainly spans between 0.1 & 0.4

The distribution for the Borrower's Rate is almost identical to that of Borrower's Annual Percentage Rate (APR). They share the similar pattern of having multiple modes within the same range.

The distribution is skewed to the right between 0 and 15,000. This indicates the income range for most of the borrowers.

The distribution shows that the Debt to Income ratio is skewed to the right and it's between 0 & 1. It also has peak that's between 0.25 & 0.29. It indicates that most borrowers prefer a ratio between 0.25 & 0.29.

In the distribution above, we see multiple modes. The most loan amount is between 4,000 & 6,000. The multiple mode situation suggests that people like to borrow in the multiples of 5,000

From the pie chart, we see that almost 80% of borrowers are employes which suggests that being employed is significant to getting a loan.

From the above, we see that most of the people fall within ratings E to A. The remaining percentages are either people with the lowest credit risk (AA) or the highest credit risk (HR)

From the chart, most people borrow because they want to consolidate debts. Borrowing more to consolidate debts doesn't sound healthy.

From the chart, most people did not state their occupation. 2nd to that group of people are those who are professionals.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The distribution for the Borrower's Rate is almost identical to that of Borrower's Annual Percentage Rate (APR). They share the similar pattern of having multiple modes within the same range.

The distributions for the Debt to Monthly Ratio & Stated Montly Income were skewed to the right.

80% of borrowers are employed. It suggests that employment status is one of the criteria for loan allocation.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There no unusual distributions and I did not perform any further operation to tidy, adjust, or change the form of the data.

Bivariate Exploration

From the heatmap and the gridplot, we see that only a few variables are strongly correlated. Examples of varibles with good correlation are Borrower APR & Borrower Rate, Loan Original Amount & Monthly Loan Payment.

The plot above further solidifies that the distribution between Borrower Rate & Borrower APR are similar to each other.

We see a linear relationship between the Borrower APR & the Prosper Ratings. The lower the Borrower's APR, the better the Prosper Ratings.

There's a linear relationship between the Original Loan Amount and the tenure of the loan. The longer the term, the higher the amount borrowed.

The above image shows a significant correlation between the Original Loan Amount & the employee status. The employed people enjoy more original amout. It further solidifies the earlier assertion that the employee status is a key criteria to loan disbursement.

From the interactions above, we see that most people like to take loans of 3 years and above. Also, those under the "HR" rating only took three years term loans.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Further comparison was done between Borrower APR & Borrower Rate that confirmed the similarities in distribution between the two variables. The Borrower APR & Borrower Rate are strongly correlated.

There's a linear relationship between the Borrower APR & the Prosper Ratings. The lower the Borrower's APR, the better the Prosper Ratings.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Most people like to take loans of 3 years and above. Also, those under the "HR" rating only took three years term loans.

There's a significant correlation between the Original Loan Amount & the employee status. The employed people enjoy more original amout. It further solidifies the earlier assertion that the employee status is a key criteria to loan disbursement.

There's a linear relationship between the Original Loan Amount and the tenure of the loan. The longer the term, the higher the amount borrowed.

There's a strong correlation between Loan Original Amount & Monthly Loan Payment.

Multivariate Exploration

Month does not appear to have an impact on the Loan Amount & Borrower APR

Year also does not appear to have any major impact on the Loan Amount & Borrower APR

There's an increase in loan amount as the rating gets better. There's also reduced Borrower APR as the rating gets better.

We can see that with better Prosper rating, the loan amount of all three terms increases, the increase amplitude of loan amount between terms also becomes larger.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There's an increase in loan amount as the rating gets better. There's also reduced Borrower APR as the rating gets better.

Month does not appear to have an impact on the Loan Amount & Borrower APR

Year also does not appear to have any major impact on the Loan Amount & Borrower APR

Were there any interesting or surprising interactions between features?

With better Prosper rating, the loan amount of all three terms increases, the increase amplitude of loan amount between terms also becomes larger.

Conclusions

The distribution for the Borrower's Rate is almost identical to that of Borrower's Annual Percentage Rate (APR). They share the similar pattern of having multiple modes within the same range. Both variables are strongly correlated.

Most people like to take loans of 3 years and above. Also, those under the "HR" rating only took three years term loans.

There's a significant correlation between the Original Loan Amount & the employee status. The employed people enjoy more original amout. It further solidifies the earlier assertion that the employee status is a key criteria to loan disbursement.

Almost 80% of borrowers are employes which suggests that being employed is significant to getting a loan.

With better Prosper rating, the loan amount of all three terms increases, the increase amplitude of loan amount between terms also becomes larger.